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Upper and lower bounds for the typical storage capacity of a constructive algorithm, the Tilinglike 
Learning Algorithm for the Parity Machine [M. Biehl and M. Opper, Phys. Rev. A 44 6888 (1991)], 
are determined in the asymptotic limit of large training set sizes. The properties of a perceptron 
with threshold, learning a training set of patterns having a biased distribution of targets, needed 
as an intermediate step in the capacity calculation, are determined analytically. The lower bound 
for the capacity, determined with a cavity method, is proportional to the number of hidden units. 
The upper bound, obtained with the hypothesis of replica symmetry, is close to the one predicted 
by Mitchinson and Durbin [Biol. Cyber. 60 345 (1989)]. 
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I. INTRODUCTION 



In this paper, we consider the problem of learning binary classification tasks from examples with neural networks. 
The network's architecture and the neurons' weights are determined based on a training set of examples or patterns 
C a , composed of P — aN input vectors {x A '}^ =1 p in iV-dimensional space and their corresponding classes r M = ±1. 
The latter are the targets to be learned. Hereafter, we call a = P/N the size of the training set. One interesting 
property that characterizes a neural network is its storage capacity, which is the size a c of the largest training set 
with arbitrary targets the network is able to learn (with probability 1). The perceptron, a single neuron connected 
to its inputs through N weights, performs linear separations and has a storage capacity a c = 2 (l]-[|]. It is possible 
to increase the storage capacity of neural networks by considering more complicated architectures, like those with 
one hidden layer of k units. Such monolayer perceptrons map each input vector x to a binary fc-dimensional internal 
representation determined by the outputs of k perceptrons, which in this context are also called hidden units. The 
overall network's output to an input pattern is a boolean function of the corresponding internal representation. This 
i— i ' function may be learned by an output perceptron, but then the internal representations of the training set must be 
linearly separable. In order to get rid of this constraint, networks implementing particular functions of the hidden 
states have been investigated. Among these, the committee machine, whose output is the class of the majority of the 
hidden units, and the parity machine, whose output is the product of the k components of the internal representation, 
have deserved particular attention || . 
, Learning consists of adapting the number of hidden perceptrons and their weights in order that the outputs of the 
ON ' network to the training examples match the corresponding targets. The main problem is that the internal representa- 
tions are unknown. Besides the CHIR algorithm pL that determines the internal representations through a random 
process involving learning faithful sets of internal representations with k fixed, most learning algorithms build the in- 
ternal representations through a deterministic incremental procedure that determines k by construction. In the latter 
case, the hidden perceptrons are trained one after the other with targets that differ from one algorithm to another, 
until the correct classification is achieved. The first incremental procedure has been proposed by Gallant [Q. Many 
other authors developed further this idea, like Mezard and Nadal with the Tiling Algorithm ||, Rujan and Marchand 
*^ ■ with the Sequential Learning Algorithm || and Biehl and Opper with the Tilinglike Learning Algorithm |M. Other 
variations have been proposed |n],[^| . It has been argued that these incremental procedures may require a number of 
hidden units much larger than the number actually needed by a network making use of its full storage capacity. In the 
following we distinguish thus the algorithm's capacity, defined as the size of the largest training set (with arbitrary 
targets) learnable with the algorithm, from the capacity of the network with the same architecture. Clearly the 
former cannot be larger than the latter. An upper bound for the storage capacity of the parity machine with k hidden 
perceptrons has been obtained by Mitchinson and Durbin [ p"3| through a geometric approach: a c (k) < fclnfc/ln2. 
Recent replica calculation results, obtained in the limit of a large number of hidden perceptrons (k — > +oo) [ p^[ , 
strongly suggest that this upper bound may effectively be reached. However, the learning problem remains: is there 
a learning algorithm whose capacity saturates this bound? This question was addressed recently in |l5| ] within the 
same statistical mechanics framework as the present work. In spite of a thorough analysis, no clear-cut conclusion 
could be drawn in the asymptotic regime of large k, because of a lack of precision in the numerical integration of the 
corresponding equations. 
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In this paper, we determine analytically the storage capacity of a parity machine built with the Tilinglike Learning 
Algorithm (TLA) . Our results present strong evidence showing that the storage capacity of the obtained network is 
close to the upper bound, at least within the replica-symmetry approximation. The paper is organized as follows: 
in section O, we describe the TLA. The conditions necessary for the TLA to converge impose strong constraints on 



the cost function used to train the hidden perceptrons. These are discussed in section III. Despite intensive research 
in this field, no analytic results on the learning properties of the perceptron with threshold, in the asymptotic limit 
a — > +00 needed here, exist. These are deduced in section |iy| for the Gardner cost function with vanishing and finite 
margin, within the replica-symmetry (RS) approximation. As this approximation is known to provide only a lower 
bound to the perceptron's actual training error [|l6|,[l7]], we also determined an upper bound through a generalization 
of the Kuhn- Tucker (KT) cavity method proposed by Gerl and Krey Jig] . The general expression for the number of 
hidden perceptrons generated by the TLA in the limit a — > +00 is deduced in section M. Our main result is that the 
number of hidden units needed by the TLA to converge grows proportionally to a/(\na) 1 ' in the large a limit, where 
v = 1 in the RS approximation and v — within the KT cavity method, provided that the hidden perceptrons learn 
through the minimization of their training errors. Our results are discussed and compared both to the Mitchinson 
and Durbin bound jl3| and to the numerical results obtained by West and Saad The general conclusion is left 
to section VI 



II. THE TILINGLIKE LEARNING ALGORITHM (TLA) 

In the following, we describe the Tilinglikc Learning Algorithm (TLA) considered in the following because of its 
simplicity. The TLA needs hidden perceptrons with a threshold to generate the parity machine. The classification 
performed by a perceptron is a linear separation defined by a hyperplane in the A-dimensional input space, of normal 
vector J (J • J = 1) and distance to the origin 9. The N components of J are the perceptron's weights and 9 is its 
threshold. An example x is classified as follows: 

o- = sign (J • x - 6») . (1) 

As already pointed out in Jl5[ the threshold is useful in the case of unbalanced training sets, containing more 
examples of one class than of the other. As we will see in the following, this is the case for the successive perceptrons 
included by the TLA. 

In the first learning step of the algorithm, the parameters Ji and 9\ of a perceptron are adapted in order to obtain 
the lowest possible number of training errors. This is usually done through the minimization of a cost function: 

p 

E(3i,9i;C a ) = ^2,V (Ai) (2) 

where the potential V is a function of A] 1 , the stability of the example fi: 

Ai = (Jx • ^ - 9 X ) . (3) 

The stability is positive if and only if the example is correctly classified. Its absolute value is the distance of the 
example to the separating hyperplane. 

In principle, there is some freedom in the choice of the potential V(X). As it has to penalize training errors, it 
has to be a decreasing function of A. Considering as cost function the number of training errors corresponds to the 
particular choice V^(A) = ©(— A), where &(x) is the Heaviside function. Other potentials, that do not minimize the 
number of training errors but possess interesting learning or algorithmic properties may be chosen. Examples are 
V(X) = (re — A)" 0(k — A) where k > is a fixed positive margin chosen a priori. The case n — corresponds to 
the so-called Gardner potential ||,[|] which reduces to the error counting function for n = 0. The potential defined by 
n = 1 corresponds to the Perceptron learning algorithm Jl9|-pl| and n = 2 to the AdaTron . 

After learning, the training error of the first perceptron is: 

4(J!,0 i ;£ a ) = if>(-r'vn (4) 

where er^ , the class given by the perceptron to the example depends through equation (Q) on the parameters J J 
and #1* that minimize the cost function. 
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If the training error is zero, the learning procedure stops. Then, the class associated to the patterns by the parity 
machine is just the class given by the first perceptron. Otherwise, another perceptron is included and trained with 
the aim of separating the correctly learned examples from the wrongly learned ones. The corresponding training set 
£? a = {x^,t^} =1 p contains the same input examples as C a with new targets defined as follows: t% = +1 if 
the example x M is correctly classified by the previous perceptron and r^ 1 = — 1 if not. These targets may be expressed 
as t% — <J\ . Notice that a fraction 1 — e\ of patterns have targets +1, and a fraction e\ have targets —1. Since we 
expect the training error e\ to be smaller than 1/2, the probability of targets —1 is smaller than that of targets +1. 
The successive perceptrons need a threshold to learn such biased training sets. Otherwise, the tilinglike construction 
cannot converge. 

The parameters J2 and Q\ of the second perceptron are learned with the training set C 2 a: minimizing the 
same cost function as the first one. The same procedure, in which the perceptron i + 1 learns the training set 
= {x A ',r^ 1 = r/Vf } p , has to be iterated until e\ = 0. Then, the product cr p of the classes {crf}- =1 ... k 

given by the hidden perceptrons to an example x M corresponds to the target [ ^0|Jl2[ , as = ■ ■ ■ a k = 
a i ' ' ' cr fc-2( cr fc-i) 2 ' r fc-i = a \ ' ' ' a k-2 T k-i = ••• = . Thus, the TLA constructs a parity machine with k hidden 
units. 



III. CONVERGENCE CONDITIONS 

It has been shown that if the examples are binary |^p| , or real- valued vectors in general position |p3| , there is a 
solution that satisfies the TLA construction with the property that Pe\ is a succession of decreasing integer numbers. 
Thus, a finite k < P exists for which — 0. 

In the following, we are interested in the typical number k of hidden perceptrons necessary for the TLA to learn 
a training set of size a. This is obtained in the thermodynamic limit where N and P diverge keeping a = P/N 
constant. In this limit, k is expected to be independent of the particular set of training patterns, and to depend only 
on a. However, as P — ► +00, it is not possible to argue that Pe\ is a succession of strictly decreasing numbers in order 
to guarantee the convergence of the TLA in a finite number of steps (i.e. of hidden units). In particular, the solution 
in which a single example is correctly learned at each step, used by the convergence proofs Jl(],^3| at finite N, leads 
to k — > +00. In order to obtain a finite number k(a) in the thermodynamic limit, each perceptron has to learn at 
least a number of examples of the order of N. This imposes some general conditions on the learning algorithm used 
to train the perceptrons. 

It is worth to point out that the conditions for convergence with finite k in the thermodynamic limit do not guarantee 
the convergence for all the possible training sets of size a. This is due to the probabilistic nature of the statistical 
physics results, which predict the average behaviour. The results may not be correct for subsets of zero measure in 
the space of training sets, and in particular for the worst case. 

As described before, the training set C l a used to train the perceptron i contains a fraction 1 — ej -1 of patterns 
with targets +1, and a fraction e^T 1 of patterns with targets —1. These targets are slightly correlated, as they are 
determined by the training errors of the preceding perceptron. However, it has been shown that these correlations 
are weak |l5|]. We neglect them in the limit a — > +00 considered in the following. Thus, we consider that the targets 
to be learned by the successive perceptrons are i.i.d. random variables, and have a probability 1 — ej^ 1 to be +1 and 
e 1 ^ 1 to be — 1. As this neglects the constraints imposed by the correlations on the minimization of the training error, 
we expect that the assumption of uncorrelated targets underestimate the perceptrons' training errors. It follows that 
our estimation of the number k(a) of perceptrons necessary to construct the parity machine is a lower bound to the 
actual value. 

Consider a perceptron learning a training set of size a with targets given by the following biased probability 
distribution: 

P(t) = (1-e)S(j-1)+eS(j + 1). (5) 

If £t{a, e) is the perceptron's training error, i.e. the fraction of wrongly learned examples, there is a simple relationship 
between the training errors ej -1 and e\ of two successive hidden perceptrons: 

e\=£ t {a,e\- 1 ) (6) 

since the bias in the probability of the targets of perceptron i is due to the training error of the preceding unit. 

The successive training errors e\ must decrease monotonically with i and eventually vanish for a finite k. Otherwise 
the TLA does not converge. Taking equation (^|) into account, this imposes that: 
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£t(a,e) < e. 



(7) 



Condition (0) restricts the possible potentials in the cost function (|2|). For example, in the following section we show 
that the Perceptron and the AdaTron potentials ]l9|-p2|] do not satisfy the condition (|t|) for all a when e < 1/2. 
The stopping condition of the TLA imposes that there is a finite value of k such that: 

E k t =S t (a,e k t - 1 )=0. (8) 

This in turn imposes that for all a, there always exists So(a) ^ such that £t(a, eo(a)) = 0. Thus, the stopping 
condition (JsJ) imposes that the inverse function ao(e) diverges as e — * 0. In fact, ao(e) is the storage capacity of 
a perceptron learning targets drawn with the biased probability (^) (in the literature, the bias is usually defined 
as 1 — 2s). Actually, the divergence of a>o(e) occurs whenever the potential V(X) vanishes for A > and is strictly 
positive for A < 0. This is the case for the Gardner potential with n = Q, for which ao(s) ~ — (e lne) -1 0-0]. However, 
even if the perceptron has been extensively studied, very few results exist for the case of training sets with biased 
distributions of targets P, |l0| , p4t . In particular, the asymptotic behaviour of the learning curves £t(a, s) as a function 
of a is unknown. These are deduced in the next section. The reader not interested in these intermediate calculations 
may skip them and go straight to section |y|. Only the results displayed by equations (|24|), (|30|), J45| ) and ( ft8| ) are 
used to determine the asymptotic behaviour of the TLA. 



IV. PERCEPTRON'S TRAINING ERROR FOR BIASED TARGET-DISTRIBUTIONS 

In order to learn such training sets with biased distributions of targets, the perceptron must have a threshold, as 
the separating hyperplanes that minimize the training error do not contain the origin. Here we present new analytic 
results, mainly in the asymptotic regime a — > +oo, for the Gardner cost function defined by the potential: 

V(X) = 6(k — A). (9) 

For k = 0, the corresponding cost function is the number of training errors. For k > 0, the cost function is the number 
of examples with stability (0) smaller than k. 

The section is divided in two parts. In the first one we derive results within the Replica-Symmetry (RS) approx- 
imation, which is known to underestimate the training error. In the second part we obtain upper bounds for the 
training error, using a cavity method. 



A. Replica calculation 

We briefly recall the main steps of the replica calculation, that follows the same lines as |l(],^4]]. As we are interested 
in the properties of the minimum of the cost function, a temperature T = 1//3 is introduced and the cost function is 
considered as an energy. The corresponding partition function writes: 

Z(p,£ a (e)) = J d8P{8) fdJP(J) exp (-0E(J, 0;C a (e)) (10) 

where the components of J are the weights, and 9 is the perceptron's threshold. C a {e) is a training set of size a. The 
input vectors x M are drawn from a gaussian distribution with zero mean and unit variance in all the directions. The 
targets have the biased distribution (|5|). 

Following Gardner's approach, the patterns of the training set are considered as frozen disordered variables. The 
replica trick allows to calculate the mean free energy in the thermodynamic limit (N — * +oo, P — > +oo and a 
constant) averaged over all possible training sets, as follows: 

f(a,e)= lim lim lim --j- \nZ»(p,C a {e)) (11) 

/3^+oo 7V-++oo n— >Q jnl\ 
a=P/N 

where the bar stands for the mean over the training sets with same size a. Thus, the free energy is obtained through 
the averaging of a partition function of n replicas of the original system. Hereafter we assume replica symmetry (RS), 
i.e. that the replicas are equivalent under permutation. However, it is well known that replica symmetry breaks down 
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when the training error is finite |2q ] . Calculations including one step of replica symmetry breaking have shown that 
the training error obtained within the RS approximation is a lower bound for the actual one 

Assuming that the weights have a uniform prior probability over the surface of the iV-dimensional sphere of unitary 
radius, and the threshold a uniform distribution over the real axis between —y/N and +y/W, the free energy within 
the RS approximation writes: 



/(a, e) = max min </(«, e, c, fl) (12) 



where the function g is: 



l(a,e,c ) 0) = ~ + a(l-e) / W(\(y,c),y,c)exp( ( "" > ''" 



2c ' J y Vi " na ' ' r V 2 J2^ 



+ ae I W(X(y,c),y,c)e^(-^—^-) A (13) 



/2vr 

with X(y,c) the function that minimizes: W(X,y,c) = V(X) + (X — y) 2 /2c. c is the usual order parameter in replica 
calculations (c = lim^+oo /3(1 — J a • J&) with J a and Jb the directions corresponding to two different replicas). The 
parameters c and 9 are solutions of the following extremum conditions: 

£-£-* <»» 

The training error £ t (a,£) may be easily deduced by integration of the distribution of stabilities over the negative 
values p^l , yielding: 

(y + 0) 2 \ 



£ t (a, e) = (l-e)Je (-X(y, c)) exp 



/2vr 



: / « ( -.\(, y .c)l,xp(-^-_^!)^. (15) 



Equations © to @ are valid for any potential V(A) in (0). In the following, we concentrate specifically on the 
Gardner potential (0). The function X(y,c) that minimizes W(X,y,c) for a given k is: 

fy for y < k — y/2c 

k for k - V2c < y < k (16) 
?/ for k < y 

Introducing ( |l6| ) into (|l3|), we deduce g(a,e,c 1 9). The conditions (|l4|) allow to determine the equations for c and #: 

- = (!-£)/ __ (K + d-y) 2 Dy + e _ ( K ~6-y) 2 Dy, (17) 

0=(l-e)/ _ (K + e-y)Dy-s _ (K-6-y)Dy, (18) 

where Dy — exp(—y 2 /2)dy/\/2n. The distribution of stabilities of the training patterns is /o(A) = (1— £)p_|_(A)+£p_(A) 
with: 

P±(X)=8(X-k) __ Dy (19) 

iK-\/2c±fl 



{e(^ — V2c— A) + 6(A - «)} exp f- ^^ ) 



2tt 



p(A) presents a two band structure with a gap between A_ = K — \/2c and A + = K. Notice that only if A_ < the 
lower band corresponds to wrongly classified patterns. If n > 0, then A_ may become positive for sufficiently small 
values of c. In that case, the training error is only a fraction of the patterns lying in the lower band. Taking this into 
account, the training error £ t (a,e) is : 
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p-\-oq /"+00 

£ t (a,e) = (l-e) __ Dy + e _ Dy, (20) 

J max — J max p. \/2c-K+e] 

We derive separately the asymptotic properties for k — and for k 7^ 0, for reasons that will become clear in the 
following. 

We consider first the case k = 0. The band of positive stabilities starts at A+ = so that the gap, of width \/2c, lies 
strictly in the region of negative stabilities. As we expect that the gap vanishes for a — > +oo, we look for solutions of 
the extremum equations with c — ► and \0\ — > +oo (notice that 8 is negative for e < 1/2) with the product a = 9\f2c 
finite. Introducing these assumptions into Jl8[) , we determine e as a function of a: 

ea ( 1 -°)- 1 (21) 



2 (cosh a — a sinh a — 1 ) 

The relation between a, 9 and a follows from (|l7j) and (pi]): 

1 _ exp (-6» 2 /2) J a 2 (sinh a - a) 
a 9 3 \Z2tt \ cosh a — a sinh a — 1 



(22) 



a and 6 are increasing functions of e as expected. For a symmetric distribution of targets (e = 1/2) then a = 
corresponding to a vanishing threshold. Conversely, if all the targets are +1 (e = 0), the threshold diverges to — oo. 
For finite e < 1/2, the absolute value of the threshold is an increasing function of a. From equation ( p2] ) we obtain 
the development 9 2 = 2 In a + O (In In a). Notice that neglecting In In a with respect to In a is an approximation only 
valid for large enough a (a > 10 10 ). As was already pointed out in p4| , this behaviour cannot be deduced by solving 
the equations ([l7]) and ( |l8| ) numerically. 

The training error £t(a, e) ( p0| ) with n = in the limit a — > +oo is then: 

exp(-g 2 /2) f sinha-a 1 

£ t {a,e)~e —== — '- < — : — ->. (23) 

9V2tt t cosh a- a sinh a- lj 



Using equations (22) and (E3|), we deduce: 



£ t {a,e)~e — ~ e ^— (24) 

where a(e) is the inverse function of e{a) given by (pl|). 

Consider now the Gardner potential with finite k. Although, a solution of equations (17) and (|l^ ) under the 
assumption that c — > with finite 9 in the limit a — > +oo exists, it does not correspond to the correct extremum of 
g ([l3]). It is however worth to examine it. The corresponding value of 9 as a function of e and n follows form (|l§|), 
and the relation between a, 9, k and c from (|17[). We find: 



1 



1 + cxp(2«;6 



(25) 



a 3V2tt 

As V2c < k, the training error given by equation (EG) writes: 



l - 2 ^'\^(-^±^). (26) 



/TOO 
Dy (27) 
-6 

and is larger than e for any finite 9. Notice that this (incorrect) solution does not satisfy the condition (Q) necessary 
for the TLA to converge. 

In fact, the correct training error corresponds to a solution with finite gap (\/2c — ► 2k) and a diverging threshold 
(9 — > —00) in the large a limit. Defining 5 = 2k — V2c, and keeping only the leading terms, equations (|l7|), ( p^| ) and 
@ for k > give: 



G 



_J_ exp(S(9 + K)) 

l-e~ 2k{6 + k) ' 1 ' 

i_, 2«( 1 - I ) ,_M» + !Oiy (29) 



a (6» + k) 2 V27t 

ft(a,e) - e "M5- (30) 



The neglected terms are of the order 0(exp(— 2K-\/21nQ; + ln In a)), which are only negligible if k is finite. The prefactor 
1/(2k) 2 in (fjO|), that diverges when re — > 0, reflects the existence of the different behaviours for vanishing and for finite 
n. 

This second solution only exists for bounded potentials. The Perceptron and the AdaTron potentials diverge for 
A — > — oo, and the corresponding training errors become larger than e in the large a limit. Thus, if these learning 
algorithms were used to train the hidden perceptrons, the TLA would not converge. 

Although the case of unbiased targets (i.e. e = 1/2) is not essential for our study, we include here the corresponding 
analytic results for the sake of completeness. In this case, the free energy g ( p"3| ) is invariant with respect to the 
threshold symmetry 9 <-> —6. Thus, 6 = is a trivial extremum of g. However, as already discussed by West and Saad 
in |24j , two new solutions breaking the threshold symmetry appear above a given training set size ag . The analytical 
expression of ag may be deduced under the assumption that the two different solutions appear continuously at ag , as 
in usual second order phase transitions, through a series expansion of the free energy in powers of 9: 



g{a, s, c, 9) = g(a, e, c, 0) + . 



9 2 d 2 g 9 4 d 4 g 

~^ dO 2 e=0 + 24 W 



(31) 



Due to the symmetry, the odd derivatives with respect to 9 vanish. The condition 

d 2 g 



t)t) , ° = / _y(K-y)Dy (32) 



defines v 2c at the transition. The size ag satisfies: 



-l 

^2 ; 



k-V2c 



ag=[ (« - yYDy (33) 

and the two new solutions that appear for a > ag correspond to a threshold 9± ~ iy/a — ag. Notice that the usual 
stability criterion for second order phase transitions, d i g/d9 i > 0, cannot be directly applied here because we have 
two order parameters. Taking into account the leading corrections to c, proportional to 9 2 , it is straightforward to 
verify that the solutions with finite threshold are stable. 

B. Kuhn-Tucker cavity method 

In order to circumvent the RS approximation, we determine the training error £t(a, e) using the Kuhn-Tucker (KT) 
cavity method proposed by Gerl and Krey |Q , that we generalize here to the case of a perceptron with a threshold 
learning a training set with a biased probability of targe ts g iven by (j^). Contrary to the RS solution, this cavity 
method has been shown to overestimate the training error fll8|] . Consequently, the results allow us to deduce an upper 
bound for the number of perceptrons needed by the tilinglikc procedure to converge. 

The KT cavity method allows to determine the properties of the perceptron by analyzing self-consistently its 
response to the introduction of a new pattern into the training set. It is particularly adapted to study the properties 
of the Gardner potential (^|) because it is based on the fact that the weights minimizing the corresponding cost 
function are a (conveniently normalized) linear combination of the patterns with stability k, which are called support 
vectors. 

Let us assume that the perceptron has learned the training set and that the value of the cost function is E. This is 
the number of examples with stability smaller than the margin k. The support vectors belong to the subset of aN — E 
remaining examples that do not contribute to the cost. The perceptron's weights may be expressed as follows: 



N 

IJ,£{aN-E} 
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with a M > for A M = k, and a M = for A p > tt. These are the so-called Kuhn- Tucker conditions. Defining a M = for 
examples with A M < k, the normalization of the weights imposes: 



1 aN aN 

l = J-J = -^r^x^J = -^^(/ t + r^). (35) 

fl—1 /I— 1 

As usual with cavity methods, we introduce a new example x° with target r , drawn respectively with the same 
probability densities as the other inputs and targets in the training set. Before any modification, as the pattern 
is uncorrelated with the direction J and its components are assumed to have a gaussian distribution, its projection 
onto J has a gaussian probability. Therefore, the joint probability distribution of the target r° and the stability 
A = t°(x° • J — 6) before learning is: 



n(A°,r°) 



■ eX P o ( 36 ) 



2tt 



where P(t°) is defined by (||). We assume a single ground state and we calculate the necessary adjustments of the 
weights J in order to obtain self-consistent equations for the cost function as a function of a. 

If A > k, no learning is needed, as the new example does not contribute to the cost. If A < k, two different 
situations may occur. Either the distance of the new example to the hyperplane is too large and the perceptron is 
unable to learn it, or the example is close enough and can be learned. The natural strategy to minimize the cost 
function is to include the new example in the subset of support vectors only if k — y2c < \° < K, where v2c is a 
positive quantity which has to be determined self-consistently. Otherwise, the weights are not modified and the new 
example is left in the subset of examples contributing to the cost. We are left with the problem of determining the 
perturbation on the weights such that examples with k — \/2c < A < k become support vectors after learning. As 
a first step, this can be obtained by taking a = n — A . However, this modifies the stabilities of the other support 
vectors. The coefficients a M > (/i > 1) must be corrected by a small amount to compensate for this perturbation. 
This correction in turn modifies the stability of the new example 0, and a has to be corrected. After a full summation 
of the contributions, Gerl and Krey have shown that the correct value of a is: 

o - K -~ X ° 

a ~ l-aP(a»>0) [ ' 

where P(a fl > 0) is the probability that a M > 0. This probability is determined assuming that the new example is 
equivalent to the others: 



P{a^ > 0) = P{a° > 0) = _n(A°,r°) dA°. 



(38) 

Having specified the learning procedure, we are able to determine V2c and E self-consistently. First of all, the 



normalization of the weights given by equation (35), may be written as follows: 



l = a ]T f +OC a a ( K + T°e)u(~X a ,T a )d\ a (39) 



T°=±l 



with a given by (37) for k — \[2c < A < k and a = elsewhere. Combining equations and ([39|), we obtain: 

i-K+e 



1 = a(l-s) [ 

J K 



{l + {K + 6){K + 6-y))Dy (40) 

-V2c+0 

j-K-6 

+ ae _ (l + (K-6)(K-6-y))Dy. 

This equation, which determines y/2c for a fixed threshold 9, is slightly different from the RS result (|l7j). The cost 
function E is determined assuming that it remains unchanged (to order y/N) upon learning the new example. Thus, 
the cost per example writes: 
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rp fK— v2c 

T°=±l ' 



DO 



Dy + e Dy. (41) 



oo 



Notice that when ft — v^2c < 0, Ej (aN) (f4l| ) represents the fraction of training errors £*(a, e) and is similar to 
The threshold may be optimized in order to minimize the cost function: 

In the following, we solve ( |40| ) and (^2|) in the large a limit. First of all, we consider the case k = 0. In this case, 
E/(aN) (equation (fll])) is the training error £ t . As for the RS calculation, we may assume \[2c <C \6\ and a = B\[2c 
finite. We obtain the following equations: 



1-E 



e 2a 



4a 2 + l-2a , (43) 



I c + ,44, 



c , . F(a) 1 (1 + 2a -74^+1) 

£ t (a,e) ~ e — ~e ^ '—. (45) 

a a a (1 + V4o 2 + 1 - 2a) V 7 



These results differ from those obtained with the RS calculation (Equations (|2l|), fl22|) and (|24|)). 

In the case of finite margin k, the pertinent assumptions in the large a limit are \[2c — > 2k with 5 — 2k — \/2c and 

— > — oo. With these, here again E/(aN) ( f4l| ) is the training error, and we get: 



4exp(5(0 + «)) 



1 - £ (0 + ft) 



2 



(46) 



1 8«(l-e) / (0-, .., , 

: exp - - , (47) 



£*{a,e) ~ e+ — ~g-- ==. (48) 

2Ka(6» + K) 2ftav21na 

It is worth to point out that even within the KT cavity method, the training error satisfies the convergence 
conditions (0) and (@). 

The main conclusion of this section is that the TLA converges provided that the hidden perceptrons are trained 
through the minimization of a cost function with a bounded potential. The Gardner potential (|^) satisfies this 
constraint. The asymptotic behaviours of the training error in the large a limit, calculated for k = and k ^ using 
two different approaches are used in the following sections to characterize the storage capacity of the constructive 
algorithm. 



V. NUMBER OF HIDDEN PERCEPTRONS IN THE LARGE a LIMIT 

We assume that the probability distribution of the targets r M in the training set is symmetric, given by (|5|) with 
e = 1/2, so that the training error of the first perceptron is e\ — £t(a, 1/2). Considering iteratively the relationship 
between the training errors of two consecutive perceptrons (Q) yields: 

°*/a(l/2) = /a°---% (1/2) = (49) 

k times 

where f a {s) stands for £ t (a,e), the symbol o for the composition of functions and k is the number of perceptrons 
necessary for convergence of the TLA algorithm. 
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e t [a,e] 




FIG. 1. Evolution of the successive training errors. The full curve corresponds to the training error £t(a,e) of a perceptron 
with biased targets. The first training error e\ is given by £t(a, 1/2) and the following ones by the relation e' t +1 = £t(a,e\). In 
this case, the learning algorithm converges with six perceptrons. 

The evolution of the training errors of the successive perceptrons is schematically represented on figure gfor an 
arbitrary function £ t (a,£), where the tilinglike algorithm is shown to converge in six steps, i.e. k = 6. 

We are interested in the limit of large training set sizes (a — > +oo). In this limit, the training error £t(a, e) is close 
to e: 



£ t (a,e) 



h(a, e) 



(50) 



with h(a,e) a function that vanishes in the limit a — ► +oo. Notice that those cost functions that do not satisfy 
condition (pj) for all a are useless in this limit, since the error reduction at each step e l t +1 — e\ — —h(a,e\) vanishes at 
some finite value of a. For larger values of a it becomes positive, and the TLA does not converge. In the preceding 
section we showed that the Gardner potential both with vanishing and finite margin k has h(a, e) > (see equations 
( pi| ) and (|30|)) and satisfies condition (Q). 

As h(a, e) vanishes in the limit a — > +oo, we can guess that the number k(a) diverges. In this limit we can introduce 
the continuum approximation, replacing i/k by the real- valued variable x. Then, the error reduction at each step is 
given by: 



J+i 



1 de u( \ 
kdx- = - h(a ' £) 



(51) 



After integration of both sides of the equation de/h(e, a) 
and x = 1, we obtain: 



-kdx at constant cv, from e = 1/2 and x = to e = 



k(a) 



1/2 



de 



h(a, e) 



1/2 



de 



£ t (a,e) 



(52) 



Equation ( p2|) gives the asymptotic behaviour of the number of hidden perceptrons necessary for the tilinglike algorithm 
to converge in the limit a — > +oo. It depends on the cost function used to train the perceptrons through £ t (a,e). 
The storage capacity a c (k) of the TLA is then obtained through the inversion of k(a). 

Hereafter we consider the case where the hidden perceptrons are trained with the Gardner cost function, using the 
results of the preceding section. 

We determine first the number of hidden units obtained when the perceptrons minimize the number of training 
errors, that is, the Gardner cost function with n = 0. Inserting into (p2[), the result ( pi] ) obtained within the RS 
approximation, we obtain: 



k RS (a) 



1/2 



de 



e — £t{a, e) 2 In a 



1/2 



a 2 (e)de ~ 0.475- 

In a 



(53) 



where o(e) is given by (|2] 
perceptrons: 



From this result, we deduce the storage capacity in the limit of a large number of hidden 
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a Rb {k) ~ 2.11 k In k. (54) 

Surprisingly, the capacity of the TLA scales with k like the upper bound for the parity machine with the same number 
of hidden units, and only the prefactor is overestimated. 

Using the result ( |45| ) obtained with the KT cavity method, that overestimates the perceptron's training error, we 
get: 

/•1/2 de 

k KT (a)~a\ —, — 7— rr — 1.082a (55) 
Jo F{a(e)) 

where F(a) is defined in ( ^5|) and a(e) is given by (fi"3|). The corresponding storage capacity is: 

af T (fc) ~ 0.924 k. (56) 

We find that a^ T < a RS as expected. The behaviour of the storage capacity, obtained with the Kuhn- Tucker 
cavity method is linear in k. This suggests that including replica symmetry breaking in the replica calculation may 
modify the fclnfc behaviour to one proportional to fc(ln/c) l/ with < v < 1. However, as the actual training error of 
the perceptrons seems closer to the RS solution than to the Kuhn- Tucker cavity result jL8|, we expect v to be close 
to 1. 

In the following we consider the parity machine obtained when the perceptrons are trained using the Gardner cost 
function with a finite margin k. We get: 



k RS (a, k) ~ 2 K 2 a, and k KT (a, n) ~ «W21na. (57) 
After inversion of (|57j), the capacities deduced within the two approximations are: 

o^ S (fc,«)^A, and af T (fc, K )~^|= (58) 
2k^ KV21nA: 

respectively. Here again, the behaviours of k{a) and a c (k) obtained with the RS approximation and with the Kuhn- 
Tucker cavity method differ. In both cases, the value of K only affects the prefactor but not the scaling with a or k. 
Consistently, the prefactor of a c diverges for k — > 0, where the expressions ( |57| ) and (|5|) have to be replaced by ( |5^ ) 
and (|56| ) respectively, as the functional dependence of the storage capacity with k is different for k = 0. 

Imposing a finite margin dramatically decreases the capacity of the TLA. More precisely, the exponents v of 
the logarithmic factor differ, depending on the approximations (RS and KT cavity method), in both k- regimes 
(u rs (k = 0) = 1, v rs {k > 0) = 0, v kt {k = 0) = and u kt {k > 0) = -1/2). 

It is interesting to compare the exponents determined analytically within the RS approximation, to those obtained 
by West and Saad |l5| ] through a numerical iteration over the successive perceptrons' training errors. For n = 0, they 
obtain v close to 1 (n e = 1.070 and 1.049, and ni = 1.079 and 1.062, for k = 1000 and 4000 respectively (table 3 
in in very good agreement with our result u rs (k = 0) = 1. In the case of finite k, West and Saad find that the 

exponent decreases with increasing k (figure 13 left in |l5|]). Our result ( |5^ ) shows that the exponent does not depend 
on k, only the prefactor does. The dependence found numerically is probably due to higher order corrections, that 
behave like 0(exp(— 2/tv2 In a + In In a)). These terms, which are less and less negligible when approaching k = 0, 
hinder the determination of the power-law exponent in the asymptotic regime a — > +00. Remarkably, the RS and 
KT exponents v RS and v KT provide correct upper and lower bounds for the exponent obtained numerically within 
the one-step replica symmetry breaking approximation (figure 13 right in p5[ ). 



VI. CONCLUSION 



We determined analytically the typical number of hidden units needed by a simple constructive procedure, the 
Tilinglike Learning Algorithm proposed in JlOj , to build a parity machine. The number of hidden units depends 
strongly on the asymptotic properties of the learning algorithm used to train them. 

We showed that the cost function minimized by the hidden perceptrons has to be bounded. This rules out, in 
particular, the Perceptron or the AdaTron learning algorithms, as with these the training error cannot decrease 
beyond a finite value that depends on the training set size and on the bias of the target's distribution. This is 
so because the hidden perceptrons have to learn highly biased output distributions. In the asymptotic regime, large 
thresholds are needed to minimize the training error as, loosely speaking, such solutions allow to classify correctly most 
patterns of the majority class. In such solutions, a non-negligible fraction of patterns have large negative stabilities. 
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If the cost function is unbounded for A — > — oo, it favours solutions with small thresholds, which have large training 
errors. With bounded potentials, like the counting functions used in the Gardner cost function, solutions with large 
thresholds exist. 

We deduced the properties of a perceptron with threshold, learning targets drawn with a biased distribution, 
trained with the Gardner cost function with and without margin. In particular, solutions such that the training 
error is smaller than the bias always exist. This is a condition necessary for the TLA to converge. The asymptotic 
behaviour of the learning curves £t(ot,e) was determined through a replica calculation assuming replica symmetry, 
and also using the Kuhn- Tucker cavity method. The former approximation underestimates the training error, while 
the latter overestimates it. The main results are the expressions (pi]), (|30|), ( f45| ) and ( p8| ) relating the training error 
of the perceptron £ t (a,e) to the bias e of the target distribution. Closer inspection of equations ( |24| ) and ( (30|) shows 
that the error reduction £ t (a,e) — e at large a is larger if k = than for n > 0. 

These results allow us to find analytically the number of units k(a) needed by the constructive procedure to converge 
in the large a limit. As expected, the smallest k(a) is obtained when the hidden perceptrons minimize their training 
errors, which corresponds to the Gardner cost function with k = 0. Nevertheless, it is worth to study also the case 
with k > 0, which is interesting in noisy applications. The storage capacity a c {k) of the TLA is obtained through the 
inversion of k(a). Our results have been obtained under the simplifying assumption that the targets the successive 
perceptrons have to learn are uncorrelated. This hypothesis has been shown to be a good approximation fljfl in the 
limit of large training sets considered here. 

In the limit of large k we find (k) ~ 2.11 k\nk within the RS approximation. It is interesting to compare this 
algorithm-dependent storage capacity to the storage capacity of a parity machine with the same number of hidden 
perceptrons. The latter is independent of the learning algorithm. Geometric arguments Jl3|] and a replica calculation 
where the permutation symmetry among hidden units has to be broken p4[ , both lead to a c — k In kj In 2. It is 
surprising that, although we disregarded the correlations between perceptrons and assumed replica-symmetry, which 
both lead to an overestimation of the storage capacity, we find the same leading behaviour. Only the prefactor is 
overestimated. In fact, the permutation symmetry only arises when the perceptrons are trained simultaneously. As it 
is absent in the case of the incremental construction, the consequence of the RS approximation is less dramatic than 
in@. 

As the Kuhn- Tucker cavity method provides an upper bound to the perceptron's training error, it allows to determine 
a lower bound for the TLA storage capacity. This bound scales linearly with the number of hidden units, suggesting 
that a calculation including full replica symmetry-breaking may change the power-law of the logarithmic factor. We 
expect that a c ~ fc(lnfc) 1 ' with < v < 1. 
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